Running COMETS Analytics locally

Ewy Mathé, Ella Temprosa

2020-12-09

Introduction

COMETS Analytics support all cohort-specific analyses of the COMETS consortium. This collaborative work is done via the COMETS harmonization group activities. For more information, see the [COMETS website] (http://epi.grants.cancer.gov/comets/).

Data Input Format

The required input file shoud be in excel format with the following 5 sheets:

  1. Metabolites - from harmonized metabolites output
  2. SubjectMetabolites - abundances in columns and subject in rows
  3. SubjectData - other exposure and adjustment variables
  4. VarMap - maps the variables needed to conduct the cohort specific analysis. Specify the name of variables under CohortVariable column. if the VarReference has the same name in the cohort, leave blank
  5. Models - models used to conduct COMETS analysis. Outcomes, exposures and adjustment can specify multiple covariates delimited by spaces (ie: age bmi).
  6. Options - optional sheet containing model specific options

An example input file is available HERE.

Analysis Workflow for correlation analysis

1. Load Data

The first step of the analysis is to load in the data with the readCOMETSinput() function. Input for this function is an Excel spreadsheet, per the description above.

# Retrieve the full path of the input data
dir <- system.file("extdata", package="COMETS", mustWork=TRUE)
csvfile <- file.path(dir, "cometsInputAge.xlsx")

# Read in and process the input data
exmetabdata <- COMETS::readCOMETSinput(csvfile)
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus
## [1] "Metabolites sheet is read in"
## [1] "SubjectMetabolites sheet is read in"
## [1] "SubjectData sheet is read in"
## [1] "VarMap sheet is read in"
## [1] "Models sheet is read in"
## [1] "Options sheet is read in"
## [1] "There are 14 categorical variables"
## [1] "Running Integrity Check..."
## Joining, by = "id"
## Joining, by = "hmdb_id"
## [1] "Input data has passed QC (metabolite and sample names match in all input files)"

To plot some the distribution of variances for each metabolite:

COMETS::plotVar(exmetabdata,titlesize=12)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.

To plot the distribution of minimum/missing values:

COMETS::plotMinvalues(exmetabdata,titlesize=12)
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.

2. Get Model Data

There are 2 ways to specify your model, batch or interactive. In Batch mode, models are specified in your input file Models sheet. The model information needs to be read in with the function getModelData() and processed so the software knows which models to run. The following call defines the “1 Gender adjusted” model from the Models sheet in the input file to be run.

exmodeldata <- COMETS::getModelData(exmetabdata,modlabel="1 Gender adjusted")

In Interactive mode, models are specified as parameters. The model information needs to be read in with the function getModelData() and processed so the software knows which models to run.
The following call defines the model with age and bmi_grp as the exposure variables, and includes only the subjects with age > 40 and bmi_grp > 2.

exmodeldata2 <- COMETS::getModelData(exmetabdata, modelspec="Interactive",
    colvars=c("age","bmi_grp"), where=c("age>40","bmi_grp>2"))
## [1] "Analysis will run on 'All metabolites'"
## [1] "Filtering subjects according to the rule(s) age>40 & bmi_grp>2. 279 of 1000 are retained"

3. Run Simple Unstratified Correlation Analysis

The runModel() function is the main function for running a single model, and by default, a correlation analysis is performed.

excorrdata  <- COMETS::runModel(exmodeldata2,exmetabdata,"DPP")

The output of the correlation analysis can then be compiled and output to an Excel file with the following function:

COMETS::OutputListToExcel(filename="DPP_corr.xlsx", excorrdata)

To view the first 3 lines of the correlation analysis output, simply type:

COMETS::showCorr(excorrdata,nlines=3)
##   run                   outcomespec exposurespec term        corr     p.value
## 1   1 _1_2_3_benzenetriol_sulfate_2          age  age 0.164624501 0.005846722
## 2   2      _1_2_dipalmitoylglycerol          age  age 0.068903188 0.251337451
## 3   3              _1_2_propanediol          age  age 0.001667259 0.977882521

To display the heatmap of the resulting correlation matrix, use the showheatmap function.

COMETS::showHeatmap(excorrdata)




To display the hierarchical clustering of the resulting correlation matrix, use the showHClust function. This diplay requires at least 2 rows and 2 columns in the correlation matrix.

exmodeldata<-COMETS::getModelData(exmetabdata,modelspec = "Interactive",colvars = c("bmi_grp","age"))
## [1] "Analysis will run on 'All metabolites'"
excorrdata  <- COMETS::runModel(exmodeldata,exmetabdata,"DPP")
COMETS::showHClust(excorrdata, showticklabels=FALSE)

Results can be written to an output Excel file with the following command:

COMETS::OutputListToExcel("Model1.xlsx", excorrdata)

4. Run Stratified Correlation Analysis

A stratified correlation analysis can be performed by specifiying stratification variables in the call to getModelData(). If more than one stratification variable is specified, then the strata will be defined by all unique combinations of the stratification variables. The following call will define a model stratified by race_grp.

  exmodeldata2 <- COMETS::getModelData(exmetabdata,modelspec="Interactive",
                   rowvars=c("lactose","lactate"),
                colvars=c("age","bmi_grp"),strvars="race_grp")

The stratified correlation analysis is run by calling the runModel() function.

  excorrdata2  <- COMETS::runModel(exmodeldata2,exmetabdata,"DPP")

5. Linear regression with lm

Call getModelData() to define a model which adjusts for age group, has lactose and lactate as outcome variables, and has age and bmi group as the exposure variables.

  exmodeldata <- COMETS::getModelData(exmetabdata,modelspec="Interactive", adjvars="age_grp",
                   rowvars=c("lactose","lactate"), colvars=c("age","bmi_grp"))

To run a linear regression using the lm function, a list of options must be passed into runModel() with the model option set to “lm”.

  lm_results  <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=list(model="lm"))
  print(lm_results)
## $ModelSummary
##   run cohort        spec model outcomespec exposurespec wald.pvalue  r.squared
## 1   1    DPP Interactive           lactose          age  0.16451267 0.01467966
## 2   2    DPP Interactive           lactate          age  0.60149903 0.00165591
## 3   3    DPP Interactive           lactose      bmi_grp  0.02387982 0.02206445
## 4   4    DPP Interactive           lactate      bmi_grp  0.00166282 0.01641147
##   adj.r.squared     sigma     loglik       aic       bic   deviance df.residual
## 1    0.01171183 1.8029707 -2006.3702 4022.7404 4047.2792 3237.70036         996
## 2   -0.00135115 0.2882131  -172.8794  355.7588  380.2975   82.73452         996
## 3    0.01714526 1.7980076 -2002.6087 4019.2173 4053.5716 3213.43441         994
## 4    0.01146384 0.2863629  -165.4342  344.8684  379.2227   81.51171         994
##   nobs message             adjvars adjvars.removed adjspec outcome_uid
## 1 1000         age_grp.2;age_grp.3                 age_grp   HMDB00186
## 2 1000         age_grp.2;age_grp.3                 age_grp   HMDB00190
## 3 1000         age_grp.2;age_grp.3                 age_grp   HMDB00186
## 4 1000         age_grp.2;age_grp.3                 age_grp   HMDB00190
##         outcome exposure_uid             adj_uid
## 1 Alpha-Lactose          age age_grp.2;age_grp.3
## 2 L-Lactic acid          age age_grp.2;age_grp.3
## 3 Alpha-Lactose      bmi_grp age_grp.2;age_grp.3
## 4 L-Lactic acid      bmi_grp age_grp.2;age_grp.3
## 
## $Effects
##   run outcomespec exposurespec      term     estimate   std.error  statistic
## 1   1     lactose          age       age  0.034697534 0.024961296  1.3900534
## 2   2     lactate          age       age -0.002083854 0.003990177 -0.5222460
## 3   3     lactose      bmi_grp bmi_grp.2 -0.047905160 0.134163189 -0.3570663
## 4   3     lactose      bmi_grp bmi_grp.3  0.360610471 0.145954081  2.4707118
## 5   3     lactose      bmi_grp bmi_grp.4  0.420224133 0.577141138  0.7281133
## 6   4     lactate      bmi_grp bmi_grp.2  0.034207637 0.021367743  1.6009008
## 7   4     lactate      bmi_grp bmi_grp.3  0.080743228 0.023245640  3.4734783
## 8   4     lactate      bmi_grp bmi_grp.4 -0.126241151 0.091919426 -1.3733892
##        p.value
## 1 0.1648232433
## 2 0.6016151626
## 3 0.7211179261
## 4 0.0136512756
## 5 0.4667157218
## 6 0.1097166272
## 7 0.0005359045
## 8 0.1699410577
## 
## attr(,"ptime")
## [1] "Processing time: 0.15 sec"

6. Linear regression with glm

Run a linear regression using the glm function for the same variables as above. The default family used with glm is “gaussian”, which corresponds to a linear regression. The Effects data frame will be the same as with lm, but the ModelSummary data frame will contain some different columns.

  glm_results  <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=list(model="glm"))
  print(all.equal(lm_results$Effects, glm_results$Effects))
## [1] TRUE
  print(glm_results$ModelSummary)
##   run cohort        spec model outcomespec exposurespec converged wald.pvalue
## 1   1    DPP Interactive           lactose          age         1  0.16451267
## 2   2    DPP Interactive           lactate          age         1  0.60149903
## 3   3    DPP Interactive           lactose      bmi_grp         1  0.02387982
## 4   4    DPP Interactive           lactate      bmi_grp         1  0.00166282
##   null.deviance df.null     loglik       aic       bic   deviance df.residual
## 1    3285.93681     999 -2006.3702 4022.7404 4047.2792 3237.70036         996
## 2      82.87175     999  -172.8794  355.7588  380.2975   82.73452         996
## 3    3285.93681     999 -2002.6087 4019.2173 4053.5716 3213.43441         994
## 4      82.87175     999  -165.4342  344.8684  379.2227   81.51171         994
##   nobs message             adjvars adjvars.removed adjspec outcome_uid
## 1 1000         age_grp.2;age_grp.3                 age_grp   HMDB00186
## 2 1000         age_grp.2;age_grp.3                 age_grp   HMDB00190
## 3 1000         age_grp.2;age_grp.3                 age_grp   HMDB00186
## 4 1000         age_grp.2;age_grp.3                 age_grp   HMDB00190
##         outcome exposure_uid             adj_uid
## 1 Alpha-Lactose          age age_grp.2;age_grp.3
## 2 L-Lactic acid          age age_grp.2;age_grp.3
## 3 Alpha-Lactose      bmi_grp age_grp.2;age_grp.3
## 4 L-Lactic acid      bmi_grp age_grp.2;age_grp.3

7. Logistic regression with glm

Call getModelData() to define a model which adjusts for age group, has nested_case as the outcome variable, and has lactose and lactate as the exposure variables. The variable nested_case must be a binary 0-1 variable.

  exmodeldata <- COMETS::getModelData(exmetabdata,modelspec="Interactive", adjvars="age_grp",
                   rowvars="nested_case", colvars=c("lactose","lactate"))

To run a logistic regression, the list of options op must also include a model.options list with family set to “binomial”.

  op <- list(model="glm", model.options=list(family="binomial"))
  glm_results  <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=op)

8. Run Analysis on all models defined in the input Excell sheet (‘super-batch’ mode)

All models desginated in the input file can be run with one command, and individual output Excel files or correlation results will be written in the current directory by default. The function returns a list of objects.

 allresults <- COMETS::runAllModels(exmetabdata,writeTofile=F)
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.4-1     ellipsis_0.3.1       class_7.3-17        
##   [4] rprojroot_1.3-2      corpcor_1.6.9        fs_1.4.2            
##   [7] rstudioapi_0.11      farver_2.0.3         remotes_2.2.0       
##  [10] subselect_0.15.2     prodlim_2019.11.13   fansi_0.4.1         
##  [13] lubridate_1.7.9      codetools_0.2-16     splines_4.0.2       
##  [16] mnormt_2.0.1         knitr_1.29           pkgload_1.1.0       
##  [19] jsonlite_1.7.0       pROC_1.16.2          caret_6.0-86        
##  [22] broom_0.7.0          cluster_2.1.0        compiler_4.0.2      
##  [25] httr_1.4.1           backports_1.1.7      assertthat_0.2.1    
##  [28] Matrix_1.2-18        lazyeval_0.2.2       cli_2.0.2           
##  [31] htmltools_0.5.0      prettyunits_1.1.1    tools_4.0.2         
##  [34] gtable_0.3.0         glue_1.4.1           reshape2_1.4.4      
##  [37] dplyr_1.0.0          Rcpp_1.0.5           cellranger_1.1.0    
##  [40] vctrs_0.3.1          gdata_2.18.0         nlme_3.1-148        
##  [43] iterators_1.0.12     crosstalk_1.1.0.1    psych_1.9.12.31     
##  [46] timeDate_3043.102    gower_0.2.2          xfun_0.19           
##  [49] stringr_1.4.0        ps_1.3.3             testthat_2.3.2      
##  [52] lifecycle_0.2.0      gtools_3.8.2         devtools_2.3.1      
##  [55] dendextend_1.14.0    MASS_7.3-51.6        scales_1.1.1        
##  [58] ipred_0.9-9          TSP_1.1-10           parallel_4.0.2      
##  [61] RColorBrewer_1.1-2   yaml_2.2.1           memoise_1.1.0       
##  [64] heatmaply_1.1.1      gridExtra_2.3        ggplot2_3.3.2       
##  [67] rpart_4.1-15         stringi_1.4.6        gclus_1.3.2         
##  [70] desc_1.2.0           foreach_1.5.0        seriation_1.2-8     
##  [73] caTools_1.18.0       pkgbuild_1.0.8       lava_1.6.7          
##  [76] rlang_0.4.6          pkgconfig_2.0.3      bitops_1.0-6        
##  [79] evaluate_0.14        lattice_0.20-41      purrr_0.3.4         
##  [82] labeling_0.3         recipes_0.1.13       htmlwidgets_1.5.1   
##  [85] processx_3.4.3       tidyselect_1.1.0     plyr_1.8.6          
##  [88] magrittr_1.5         R6_2.4.1             gplots_3.0.4        
##  [91] generics_0.0.2       COMETS_1.5.0.0       pillar_1.4.6        
##  [94] withr_2.2.0          survival_3.1-12      nnet_7.3-14         
##  [97] tibble_3.0.3         crayon_1.3.4         KernSmooth_2.23-17  
## [100] tmvnsim_1.0-2        plotly_4.9.2.1       rmarkdown_2.5       
## [103] viridis_0.5.1        usethis_1.6.1        grid_4.0.2          
## [106] readxl_1.3.1         data.table_1.12.8    callr_3.4.3         
## [109] ModelMetrics_1.2.2.2 digest_0.6.25        webshot_0.5.2       
## [112] tidyr_1.1.1          ISwR_2.0-8           stats4_4.0.2        
## [115] munsell_0.5.0        registry_0.5-1       viridisLite_0.3.0   
## [118] sessioninfo_1.1.1